ARA: a flexible pipeline for automated exploration of NCBI SRA datasets

Abstract Background One of the most effective and useful methods to explore the content of biological databases is searching with nucleotide or protein sequences as a query. However, especially in the case of nucleic acids, due to the large volume of data generated by the next-generation sequencing (NGS) technologies, this approach is often not available. The hierarchical organization of the NGS records is primarily designed for browsing or text-based searches of the information provided in metadata-related keywords, limiting the efficiency of database exploration. Findings We developed an automated pipeline that incorporates the well-established NGS data-processing tools and procedures to allow easy and effective sampling of the NCBI SRA database records. Given a file with query nucleotide sequences, our tool estimates the matching content of SRA accessions by probing only a user-defined fraction of a record's sequences. Based on the selected parameters, it allows performing a full mapping experiment with records that meet the required criteria. The pipeline is designed to be easy to operate—it offers a fully automatic setup procedure and is fixed on tested supporting tools. The modular design and implemented usage modes allow a user to scale up the analyses into complex computational infrastructure. Conclusions We present an easy-to-operate and automated tool that expands the way a user can access and explore the information contained within the records deposited in the NCBI SRA database.

One of the most effective and useful methods to explore the content of biological databases is searching with nucleotide or protein sequences as a query. However, especially in the case of nucleic acids, due to the large volume of data generated by the Next Generation Sequencing technologies, this approach is often not available. The hierarchical organization of the NGS records is primarily designed for browsing or text-based searches of the information provided in metadata-related keywords, limiting the efficiency of database exploration.

Findings
We developed an automated pipeline that incorporates the well-established NGS data processing tools and procedures to allow easy and effective sampling of the NCBI SRA database records. Given a file with query nucleotide sequences, our tool estimates the matching content of SRA accessions by probing only a user-defined fraction of a record's sequences.
Based on the selected parameters, it allows performing a full mapping experiment with records that meet the required criteria. The pipeline is designed to be easy to operateit offers a fully automatic setup procedure and is fixed on tested supporting tools. The modular design and implemented usage modes allow a user to scale up the analyses into complex computational infrastructure.

Conclusions
We present an easy-to-operate and automated tool that expands the way a user can access and explore the information contained within the records deposited in the NCBI SRA database.
Keywords: SRA database, NGS data, database searching, sequence analysis

Background
The development of new computational tools dedicated to exploring the content of the protein and nucleotide sequence repositories revolutionized the methods of biological data access.
FASTA [1] and BLAST [2] algorithms enable fast and efficient search of large databases for similar sequences and a statistical evaluation of the results. The heuristic algorithms used by these tools are much faster than implementations of rigorous dynamic programming algorithms. These methods, however, have a limited application in the context of large data sets generated by high throughput Next Generation Sequencing (NGS) technologies. To meet the requirements of the NGS data processing, new approaches and specialized tools had to be developed.
The explosion of new techniques and algorithms for NGS sequence mapping led to the development of specialized solutions (e.g., BOWTIE [3] and BWA [4]) that can process sequence data more efficiently (in terms of required computational time) than the classical methods of sequence data exploration (e.g., BLAST and FASTA). However, although more effective, the new programs do not offer an easily accessible way for sequence-based searching of the NGS data repositories (e.g., Sequence Read Archive -SRA [5] or European Nucleotide Archive -ENA [6]). Due to the large volume of data and huge computational time required for sequence searches, exploration of the NGS databases is routinely restricted to text-based surveys with keywords and phrases provided in records' metadata. This is an obvious limitation since the user must depend on the completeness and correctness of the information provided in record descriptions.
The sequence-based screening of the whole SRA database records has been implemented recently in Magic-BLAST tool [7], designed specifically for mapping of large next-generation RNA or DNA sequencing runs. This functionality has been also incorporated, to some extent, into Bowtie2 since release 2.3.5 [8]. The availability of an option to automatically fetch SRA records is reported in prepackaged builds or alternatively, requires the source code compilation using the SRA software dependencies. A similar approach is offered by the sra-pipeline tool (https://github.com/FredHutch/sra-pipeline). Here, the program automates the process of downloading, mapping (BOWTIE2 [8]), compressing, and storing the results within the AWS cloud computing infrastructure. More advanced functionality, providing automated access to SRA records, is available in the pyrpipe Python package [9]. This tool helps, in a very convenient way, to incorporate the ability for automated RNA-seq data download into a custom pipeline. It also provides a programming interface for tools designed for basic sequence processing applications (e.g., adapter trimming). The BICF SRA Pipeline was created with a similar aim, which allows bulk download of data and quality assessment (https://doi.org/10.5281/zenodo.3739788). All of these currently available tools are designed to work by downloading the entire record data (FASTQ formatted files), which in most cases is very expensive in terms of computational resources and storage.
We have developed an integrated, feature-rich, universal, flexible, and ready-to-use solution that allows access to the data using both the SRA-tools and SRA-cloud sources. It provides a full or partial SRA record analysis mode and a choice of the sequence screening method (BLAST [2] and BOWTIE2 [8]) and taxonomic profiling (Kraken2 [10]). The modular design of the pipeline allows easy further expansion of the sequence analysis toolbox. The implemented procedure also provides basic quality checks, including the removal of adapters and filtering of reads. Along with sequence data processing the pipeline extracts metadata information for each of the SRA records. A simple configuration schema allows full control of the procedure workflow, including the option to resume interrupted analyses and effortless integration into the distributed computational infrastructure.

Implementation and tools
Similarly to Entrez Direct (command line utilities used to access in a text-based way NCBI databases) [11], the ARA (Automated SRA Records Analysis) tool is implemented in Perl and designed to be used from the shell prompt. It employs the NCBI SRA toolkit (https://github.com/ncbi/sra-tools) to download the raw data in FASTQ format from the SRA database. NCBI Entrez programming utilities provide access to the sample-level metadata along with the location of the raw data stored in the cloud. In case of a problem with fastqdump (SRA toolkit) tool-mediated data downloads, the pipeline is designed to use NCBI eutilities to find a link to the original SRA file (in the native SRA format) located either at AWS, GCP, or NCBI's storage servers. In such a case, the files are retrieved using the wget program and the NGS reads are extracted with fastq-dump. The fastq-dump supports downloading a fraction of reads from the beginning of a particular sequencing run accession. Using this tool speeds up downloading and analysis processes, which saves time and disk usage. A newer utility, fasterq-dump, from SRA toolkit, unfortunately, does not possess all the features offered by fastq-dump and is not currently implemented in the ARA pipeline. We will update our software once the developers of SRA toolkit implement the option of partial download in fasterq-dump.

Sequence quality evaluation
In order to ensure the highest quality data for processing, the downloaded sequences are subjected to a thorough three-step quality check procedure. FastQC [12] is executed on both raw and adapter-trimmed FASTQ files (Figure 1 -first and second quality check).
Trimmomatic [13] is used to filter low-quality sequences and/or remove adapters, which may profoundly influence the reliability of the alignment and other downstream analyses. The redundant reads are clustered using the Fastx toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) to reduce the database volume for sequence search and to speed up the analyses. The unique reads are mapped to the query sample using NCBI BLAST command line utility (https://www.ncbi.nlm.nih.gov/books/NBK279690/) and/or BOWTIE2 (Figure 1). In addition, the pipeline allows the taxonomic classification of reads using the 'Kraken2' program. By default, the tool uses a viral genomic reference dataset. However, it is possible to use any custom 'Kraken2' database by supplying the required data and modifying the configuration file.

Installation
The pipeline can be effortlessly set up using either Docker or Mamba package manager (https://mamba.readthedocs.io/en/latest/) which is a C++-based implementation of Conda (https://conda.io/). The Mamba implementation offers a faster interface and uses libsolv to effectively resolve the dependencies. The use of a package manager guarantees seamless installation of the required software tools along with their dependencies and has been successfully used in many recent bioinformatics projects (e.g., SCRAP [14], grenepipe [15], pyGenomeTracks [16], TransposonUltimate [17], matOptimize [18], plotsr [19], CRISPRtracrRNA [20]. The Perl modules are automatically downloaded and installed by the setup script through CPAN (https://metacpan.org/).

Workflow modes
The ARA pipeline allows analyses in two major modes: 'screening' and 'full'. Both modes include the same steps (Figure 1) that involve basic procedures in NGS sequence data analysis.
In the 'screening' mode the ARA tool will perform the analysis only on a small fraction (by default 5%) of sequences from each record. This step is useful in the selection of the best possible candidate datasets for further analyses. The 'full' mode allows downloading and processing of the entire sequence record. The 'both' analysis mode automates the two mentioned steps (first 'screening' and then 'full' analysis) (Figure 1). An additional 'summary' mode combines results of independently performed analyses (e.g., on computational grid structure) into one summary file that can be examined and used in subsequent steps (e.g., 'full' mode).

Figure 1. Graphic presentation of the ARA pipeline workflow: green arrows indicate steps run in the 'screen' mode. Blue arrows indicate steps executed in ''full' mode. Red arrows show the analysis path specific for the combined "both" mode (including automated generation of filtered SRA records list -depicted by a red rectangle). Yellow rectangles indicate user input data (query sequences and list of SRA records' accessions). White rectangles represent distinct steps in the analysis process (with tools indicated in brackets) and output reports generated by the pipeline. 'Sequence analyses' currently allow mapping and taxonomic classification, but can be easily expanded to include more tools (depicted by ellipsis in square brackets).
On each SRA 'Run ID' listed in the input file, the ARA pipeline performs data retrieval, adapter trimming, quality check, and mapping of reads using the reference sequences (Figure 1).

Features comparison
One of the most important features ( Table 1) of the ARA pipeline is its modular architecture.
Such a design allows an easy expansion of the available toolbox for downstream sequence analyses. In order to demonstrate this functionality, besides sequence mapping tools, we incorporated a module for the taxonomic classification of the reads using Kraken2 [21]. The pipeline generates a classification report for each sample using a provided reference database (viral genomes by default).
In case of problems with downloading the records, the pipeline attempts to retrieve the raw FASTQ file thrice before searching with NCBI e-utilities for alternate SRA file locations in the cloud storage. After successful retrieval of the SRA record, it extracts the single or paired-end reads using the fastq-dump utility (SRA toolkit). If both methods to download the data fail, the ARA pipeline logs the event and skips that particular SRA run accession. Following the data download step, the ARA tool performs quality checks to ensure that the absence of matching sequences is not related to sequencing errors. The parameters of each of the analysis steps can be tweaked through the ARA configuration file. A user is required to provide a list of SRA run accessions as the starting parameter for the analysis.

Output format
The results of each computational step are organized into distinct folders, optionally allowing easy manual exploration of the data. Each SRA accession's runand sample-level attributes (e.g., run accession number, run type, total spots, study accession number, title, abstract, etc) are stored in dedicated files in the output directory. Upon completion of the analysis, the pipeline also generates a combined summary file sorted by the overall alignment percentage in decreasing order (Supplementary Table 1) along with the metadata for every SRA accession.
This way of ranking the analyzed SRA records should help a user to select samples that most accurately reflect the sequences of interest for other downstream analyses. The summary file can also be regenerated by executing the analysis in the "summary" mode. The ARA tool is based on commonly used and tested programs, hence its performance solely depends on the hardware specifications and the quality of the network connection.

Examples of applications
Applying the 'screen' mode to a set of selected accessions is the simplest way to identify samples (regardless of their record annotation) that contain sequences of interest. Therefore, while simultaneously saving time and storage space, working with a fraction of the data in  Supplementary Figures 1 and 2).
One of the major factors influencing the efficiency of the ARA pipeline 'screen' mode is the fraction of the record that will be downloaded and analyzed. Although this parameter depends on the type of used query sequences and the specific characteristics of the selected NGS records, we attempted to calculate the false negative rate using our tRNA example and expanded set of SRA records (205 samples). Our results show (Supplementary File 1) that even at the lowest threshold (5%) only a small fraction (FNR = 0.073) did not pass the criterion of 1% tRNA sequence content. All the misclassified samples came from the same experiment (SRA ID SRP267192) and demonstrate the need for careful selection of the screening subset size depending on the project goals.

Conclusions
The ARA pipeline offers an easy and flexible way to explore data stored in SRA database records. It uses well-established tools in the NGS data processing and offers complete control over all the steps of the analysis. The pipeline provides a user-friendly and automated interface, starting from installation and ending with the creation of the combined summary file.
Compared to other currently available tools (Table 1), the ARA tool incorporates a wider spectrum of possible analyses and options. The sample case of screening the RNA-seq records for tRNA transcripts demonstrates that the application of the ARA pipeline allows exploration of the SRA data beyond the information provided in the records' metadata.

Availability and Requirements
Project name: ARA pipeline Bowtie2 and BLASTn for each tested SRA sample.